Transformer model

The core architecture behind the LLMs. It uses Attention mechanism

Google’s T5 paper provides a unified framework to understand and train transformer models.

Tutorials and reviews

Implementations

See also Implementations

https://huggingface.co/blog/how-to-train shows how to train a transformer model from scratch. See also How to pretrain transformer models or A complete Hugging Face tutorial: how to build and train a vision transformer

The Genius of DeepSeek’s 57X Efficiency Boost [MLA]

Applications

They are used in other areas outside Language models, including Computer vision and Reinforcement learning (Decision transformer).

Internal workings

Sanford2024transformers for the connection to Massively parallel computation.

Teh2025solving studies whether transformer can solve an empirical Bayes problem.

Cohen2025spectral studies how transformer model can predict Shortest path on a graph.

Circuit analysis

Park2025does identifies temporal heads by performing circuit analysis.